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Outline 


-  Our  domain  of  research. 

-  The  mathematics  of  strategically  complex  game  playing 

-  Move  evaluation, 

-  Min/max  depth  search, 

-  Temporal  difference  learning. 

-  Application  to  our  network  checkers  game. 

-  The  HARD  PROBLEM:  hidden  spatial  states. 

-  Information  theoretic  advisors, 

-  Combine  with  TD(0) 

-  Open  research  questions. 


Our  domain  of  research 


-  Broadly  described  as  medium  resolution  war¬ 
gaming. 

-  Maneuver  of  forces,  hidden  forces, 
ambiguously  defined  end-states. 

-  Use  machine  learning  techniques  to  develop 
strategies. 

-  Also  use  machine  learning  to  capture  and 
generalize  expertise  of  human  players 

-  Monte  Carlo  simulation  to  develop  risk  analysis. 

-  Currently  looking  at  chess/checkers  variants 

-  Networked  agents,  Hidden  agents. 


The  Mathematics  of  Strategically  Complex  Gaming 


-  Often  classical  game  theory  won't  work,  notably  because  of 
the  “tyranny  of  dimension.” 

-  Too  many  states  and  strategic  paths  over  time. 

-  Solving  tabular  stochastic  dynamic  games  is  impossible. 

-  Checkers  (a  game  of  complexity  in  the  lower  end) 

-  Has  maneuver,  materiel  diversity,  complex  strategy 

O  1  A 

ss  50  «  10  possible  strategic  paths  over  the 

course  of  a  game. 

-  Move  to  the  theatre  chess  game  (operational  war-game). 

-  Gaming  can  still  be  viewed  as  approximating  Bellman's  state- 
action  equation  ( for  the  optimal  sequence  of  actions  at  each 
state  and  time  s', ,  st+] ,  •  •  • ,  sT  )  with  various  methods. 


Evaluation 


-  Bellman's  equation  for  optimal  policy  ft  satisfies 


Q  (s  t ,  a  , )  =  E 


R  ( st,at )  +  max  0  (s,tl  ,  a  t  +  l  ) 


a  ' 


-  The  function  Q(s,a )  is  called  the  action-value  function, 
specifying  the  total  future  reward  of  taking  action  a,  from 
states  Games  like  chess/checkers  no  intermediate,  only 
terminal  reward. 

-  Gaming  playing  -  approximation  to  the  action-value  function 
by  a  function  of  a  linear  sum  of  weighted  features. 


Q 


-  The  weights  Wt  are  called  the  “advisors"  to  “feature” 
(h (s')  of  states  (force  balance). 


Min-Max  depth  search. 


max 

a 


Essentially  you  “project  into  the  future”  to  a  limited 
horizon.  The  optimal  action  is  chosen  to  be 

{min  (max  ( •  •  (min  Q  (  s  "  " ,  a  "  ' ')))} 

Best  way  to  explain  this  is  tree  search,  with  the 
value  at  the  leaves  “backed  up”  to  the  root  node 
through  mini-max  search. 


-  Limited  by  the  branching  factor  of  the  move 
(checkers  roughly  8,  chess,  36,  go  80). 


^max^ 


Your  moves 

Opponents  moves 


-5 


max 


Y our  moves 


Temporal  difference  learning. 


-  Remembering  Bellman’s  equation,  at  the  optimal 
policy  with  no  intermediate  rewards  the  difference, 

E*  (Q(st+ 1 ,  at+l )  -  Q(st  ,at)\  st,at)  =  0 

-  In  temporal  difference  learning  youjearn  by 
“shifting”  the  past  approximation  Q(st,at ) 
towards  the  future  approximation^  (5  f+1  ,at+l) 

-  This  incrementally  minimises  the  “temporal 
difference”  as  is  required  by  Bellman. 

-  Done  by  setting  (if  you  have  state  numbers) 

r- — >  r-**j  r' — ) 

Qnew  C st  ,at)  =  aQold  (sl+1 ,  at+] )  +  (1  -  a)Qold  (st ,  at ). 

-  The  learning  rate  dfmust  satisfy  some  simple 
stochastic  convergence  conditions. 


Temporal  Difference  in  Game  Playing. 


-  Too  many  states  to  solve,  so  do  TD  learning  on  the 
advisors  wt  so  we  can  generalize  to  new  novel 
states. 

-  Here  we  use  gradient  descent,  incrementing  the 

vector  of  advisors  by  * 

Aw  =  a{Q(sl+] ,  a,+i  )-Q(s„a,)  J7  - Q(s ,  ,at). 

-  Changes  in  advisor  weights  can  be  done  on-line 
(as  the  game  is  played)  or  off-line  (at  the  end  of 
each  game). 

-  Here  Q(st  ,at)  =  Pr(winning  game  \st,at) 

-  Equivalently  get  a  terminal  reward  of  1  if  win,  0  of 
loss. 


Function  Approximation 


-  We  approximate  this  probability  by 

Q  ( s,a )  =  1  /(l  +  exp  (-  ^  w^(s))) 


-  Terminal  state 


1  fora  win, 


Q(st,  *) 


1/2  for  a  draw, 


0  for  a  loss. 

v 


-  The  advisors  usually  reflect  importance  of  balance  in  pieces, 
mobility . Other  features  of  the  game. 


-  Wj  (N}  -  N2  )  +  other  terms 

-  Nj  number  of  pieces  of  side  i 


* 


An  example  in  an  imperfect  information  game 


-  Network  checkers 

-  Network  vulnerability  through  dynamic  games. 

-  Pieces  connected  by  a  network  of  varying 
topology. 

-  Only  pieces  in  the  largest  connected  sub-graph 
exhibit  mobility,  isolated  pieces  don’t. 

-  Each  side  aware  of  materiel  and  the  largest 
sub-graph  size. 

-  Network  details  hidden  (distribution  of  degree 
etc. ). 


All  in  largest  connected  sub-graph 


SCIENCE  &  TECHNOLOGY 


Example 


Largest  connected  sub-graph 


advisor  weights 


Results  of  Learning  Advisor  Weights 


games 


Hidden  Spatial  States 


-  Begun  research  onto  hidden  spatial  states. 

-  Difficult  for  the  following  reasons 

-  Vastly  increased  branching  factor  of  possible 
states.  For  example,  suppose  we  have  one 
invisible  piece.  Positive  prob.  Of  being  in  j 
squares 

New  branching  factor  =  Old  branching  factor  *  j 

-  Have  to  construct  a  distribution  of  opponent’s 
probable  states.  Pr(  opponent  state  is  s)  >  0 

-  Pruning  of  the  estimated  states  risky 
(opponent  could  exploit  this). 


Our  approach 


-  Start  with  a  small  number  of  invisible  pieces. 

-  Use  theorem  of  total  probability  and  conditioning  on  events 
to  develop  a  Markov  chain  for  the  probability  of  hidden  pieces 
in  some  state.  Generate  Prfboard, ),  •  •  •  Pr(boardn ) 

-  Really  a  non-Markov  problem  (opponent's  strategies  will  be 
history  dependent). 

-  Know  when  pieces  are  taken  -  including  hidden  piece. 

-  If  you  run  into  a  hidden  piece  you  loose  a  turn  (but  gain 
information  on  the  hidden  pieces  location). 

-  If  you  try  to  take  a  hidden  piece  and  its  not  there  you  also 
loose  a  turn,  but  gain  information  on  location. 


Estimation  example:  null  move  by  opponent. 


Prnew  (hidden  at  (3,3))  =  1/2  Prold  (hidden  at  (4,4)) 
Prnew  (hidden  at  (3,5))  =  1/2  Prold  (hidden  at  (4,4)) 


Estimation  example:  reconnaissance  move  by  self. 


Prnew  (hidden  at  x)  = 


Prold  (hidden  at  x  |  not  at  (4,2)) 


Information  theoretic  advisor 


-  Opponent’s  movement  of  hidden  pieces  increases 
uncertainty  of  state. 

-  Reconnaissance  moves  decrease  uncertainty. 

-  Entropy  the  best  way  to  model  this. 

-  If  opponent’s  states  have  probability  n  n.  •  •  -p 

-  Then  the  entropy  is  h  =  -  V  pi  log  2  pt. 

-  We  incorporate  the  value  of  reconnaissance  moves 
through  a  term  in  the  evaluation  function  that  takes 
into  account  the  entropy. 


Evaluation  of  Moves 


-  Have  to  calculate  the  expectation  over  the  possible 
board  states.  Consider  moves  that  are  legal  for  a 
particular  opposition  board  state  (with  probability 
>0)  or  a  reconnaissance  move. 

-  Reconnaissance  move-  find  out  where  the 
opponent  is  or  isn’t. 

-  Intend  to  use  temporal  difference  learning  to  find 
the  advisor  weights. 

-  Depth  of  search  nearly  impossible,  since  have  to 
carry  on  the  same  estimation/evaluation  cycle. 


Current  research  questions 


-  If  you  don't  see  an  opponent  move 

-  Was  it  a  reconnaissance  move  made  or  a  hidden  move? 

-  Have  to  learn  this  through  Bayesian  methods. 

-  Want  to  look  at  entropy  balance 

-  We  therefore  have  to  estimate  our  opponents  estimate  of 
our  state  probabilities. 

-  Pruning 

-  What  happens  when  we  prune  boards  with  extremely  low 
probability? 

-  Are  there  fast  an  frugal  heuristics  to  generate  strategies 
equal  or  better  than  the  computationally  expensive  way? 


DEFENCE 
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Questions? 


